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Abstract 

For many expensive deterministic computer simulators, the outputs do not have replication 
error and the desired metamodel (or statistical emulator) is an interpolator of the observed 
data. Realizations of Gaussian spatial processes (GP) are commonly used to model such sim- 
ulator outputs. Fitting a GP model to n data points requires the computation of the inverse 
and determinant of n x n correlation matrices, R, that are sometimes computationally unstable 
due to near-singularity of R. This happens if any pair of design points are very close together 
in the input space. The popular approach to overcome near-singularity is to introduce a small 
nugget (or jitter) parameter in the model that is estimated along with other model parameters. 
The inclusion of a nugget in the model often causes unnecessary over-smoothing of the data. In 
this paper, we propose a lower bound on the nugget that minimizes the over-smoothing and an 
iterative regularization approach to construct a predictor that further improves the prediction 
accuracy. We also show that the proposed predictor converges to the GP interpolator. 

KEY WORDS: Computer experiment; Matrix inverse approximation; Regularization. 



1 Introduction 

Computer simulators are often used to model complex physical and engineering processes that 
are either too expensive or time consuming to observe. A simulator is said to be deterministic 
if the replicate runs of the same inputs will yield identical responses. For the last few decades, 
the deterministic simulators have been widely used to model physical processes. For instance, 
Kumar and Davidson (1978) used deterministic simulation models for comparing the performance 
of highly concurrent computers; Su et al. (1996) used generalized linear regression models to design 
a lamp filament via a deterministic finite-element computer code; Aslett et al. (1998) discuss an 
optimization problem for a deterministic circuit simulator; several deterministic simulators are 
being used for analyzing biochemical networks (see Bergmann and Sauro 2008 for references). On 
the other hand, there are cases where stochastic (non-deterministic) simulators are preferred due 
to unavoidable biases (e.g., Poole and Raftery 2000). In spite of the recent interest in stochastic 
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simulators, deterministic simulators are still being actively used. For instance, Medina, Moreno 
and Royo (2005) demonstrate the preference of deterministic traffic simulators over its stochastic 
counterpart. In this paper, we assume that the simulator under consideration is deterministic up 
to the working precision and the statistician is confident of the choice of the emulator. 

Sacks, Welch, Mitchell and Wynn (1989) proposed modeling (or emulating) such an expensive 
deterministic simulator as a realization of a Gaussian stochastic process (GP). An emulator of a 
deterministic simulator is desired to be an interpolator of the observed data (e.g.. Sacks et al. 
1989; Van Beers and Kleijnen 2004). For the problem that motivated this work, the objective is 
to emulate the average extractable tidal power as a function of the turbine locations in the Bay 
of Fundy, Nova Scotia, Canada. The computer simulator for the tidal power model is a numerical 
solver of a complex system of partial differential equations and is deterministic (i.e., we are not 
allowed to question the accuracy of the simulator itself). 

In this paper, we discuss a computational issue in building the GP based emulator for a determin- 
istic simulator. Fitting a GP model to n data points using either a maximum likelihood technique 
or a Bayesian approach requires the computation of the determinant and inverse of several n x n 
correlation matrices, R. Although the correlation matrices are positive definite by definition, near- 
singularity (also referred to as ill-conditioning) of these matrices is a common problem in fitting 
GP models. Ababou, Bagtzoglou and Wood (1994) study the relationship of a uniform grid to the 
ill-conditioning and quality of model fit for various covariance models. Barton and Salagame (1997) 
study the effect of experimental design to the ill-conditioning of kriging models. Jones, Schonlau 
and Welch (1998) used the singular value decomposition method to overcome the near-singularity 
of R. Booker (2000) used the sum of independent GPs to overcome near-singularity for multi-stage 
adaptive designs in kriging models. A more popular solution to overcome near-singularity is to 
introduce a nugget or jitter parameter, 6, in the model (e.g.. Sacks et al. 1989; Neal 1997; Booker 
et al. 1999; Santner, Williams and Notz 2003; Gramacy and Lee 2008) that is estimated along with 
other model parameters. However, adding a nugget to the model introduces additional smoothing 
in the predictor and as a result the predictor is no longer an interpolator. 

Here, we first propose a lower bound on the nugget ((5/fe) that minimizes the additional over- 
smoothing. Second, an iterative approach is developed to enable the construction of a new predictor 
that further improves the prediction accuracy. We also show that the proposed predictor converges 
to an interpolator. Although an arbitrary nugget (0 < 5 < 1) can be used in the iterative approach, 
the rate of convergence (i.e., the number of iterations required to reach certain tolerance) depends 



2 



on the magnitude of the nugget. To this effect, the proposed lower bound 6ib significantly reduces 
the number of iterations required. This feature is particularly desirable for implementation. 

The paper is organized as follows. Section [2] presents the tidal power modeling example. In 
Section [3l we review the GP model, a computational issue in fitting the model, and the popular 
approach to overcome near-singularity. Section [J] presents the new lower bound for the nugget that is 
required to achieve well-conditioned correlation matrices and minimize unnecessary over-smoothing. 
In Section [5l we develop the iterative approach for constructing a more accurate predictor. Several 
examples are presented in Section [6] to illustrate the performance of our proposed predictor over 
the one obtained using the popular approach. Finally, we conclude the paper with some remarks 
on the numerical issues and recommendations for practitioners in Section [71 

2 Motivating example 

The Bay of Fundy, located between New Brunswick and Nova Scotia, Canada, with a small portion 
touching Maine, USA, is world famous for its high tides. In the upper portion of the Bay of Fundy 
(see Figure [D^a)), the difference in water level between high tide and low tide can be as much as 
17 meters. The high tides in this region are a result of a resonance, with the natural period of the 
Bay of Fundy very close to the period of the principal lunar tide. This results in very regular tides 
in the Bay of Fundy with a high tide every 12.42 hours. The incredible energy in these tides has 
meant that the region has significant potential for extracting tidal power (Greenberg 1979; Karsten, 
McMillan, Lickley and Haynes 2008 (hereafter KMLH)). 

Though the notion of harnessing tidal power from the Bay of Fundy is not new, earlier proposed 
methods of harvesting the much needed green electrical energy involved building a barrage or dam. 
But this method was considered infeasible for a variety of economic and environmental reasons. 
Recently, there has been rapid technological development of in-steam tidal turbines. These devices 
act much like wind turbines, with individual turbines placed in regions of strong tidal currents. 
Individual turbines can be up to 20 m in diameter and can produce over 1 MW of power. Ideally, 
these turbines would produce a predictable and renewable source of power with less of an impact 
on the environment than a dam. KMLH examined the power potential of farms of such turbines 
across the Minas Passage (Figure [TJ^b)) where the tidal currents are strongest. They found that 
the potential extractable power is much higher than previous estimates and that the environmental 
impacts of extracting power can be greatly reduced by extracting only a portion of the maximum 
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power available. The simulations in KMLH did not represent individual turbines and left open the 
question of how to optimally place turbines. In this paper, we emulate the KMLH numerical model 
to examine the placement of turbines to maximize the power output. 
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(a) The upper Bay of Fundy 



(b) The Minas Passage 



Figure 1: Figure (a) shows the triangular grid used in the FVCOM model for simulating tides in 
the upper Bay of Fundy. The small box in the center surrounds the Minas Passage shown in (b). 
The shaded triangles in the center of (b) represent a modeled turbine. 

We numerically simulate the tides as in KMLH by essentially solving the 2D shallow water 
equations using the Finite- Volume Coastal Ocean Model (FVCOM) with a triangular grid on the 



upper Bay of Fundy (see Figure 1(a)). Since the grid triangles differ in size and orientation, the 
i-th turbine was modeled on the set of all triangular elements whose center lie within 250 m of 



{xi,yi). A typical turbine is shown in (see Figure 1(b) ). The triangular grid was developed by David 



Greenberg and colleagues at the Bedford Institute of Oceanography, NS, Canada. The details of 
FVCOM can be found in Chen et al. (2006). 

Using this set up, the estimate of the electric power that can be harnessed through a turbine 
at a particular grid location (xj,yj) over a tidal cycle T = 12.42 hours is obtained by the simulator 
in KMLH. The average tidal power at the grid location {xi,yi) is given by 

T 



P{xi,yi) = —J P(t;xi,yi)dt, 

where P{t;xi,yi) is the extractable power output at time t and location {xi,yi). The process is 
deterministic up to the machine precision, and the main objective is to emulate P{x,y). 



It turns out that the GP model fitted to the simulator output at n = 100 points (chosen using 
a space-filling design criterion) with the popular approach is not an interpolator and results in 
over-smoothed emulator (see Example 4 for details). This is undesirable as the ocean modelers are 
interested in the emulator that interpolates their simulator. This emulator will be used to obtain 
estimates of both the maximizer of the power function (i.e., the location where to put the turbine) 
and the extractable power at such location. Since the manufacturing and installation cost of each 
turbine is roughly 20 million dollars, a good approximation of the attainable power function can 
save the cost of a few turbines when targeting for a fixed amount of power. Example 4 shows that 
the proposed approach leads to more accurate estimate of the maximum extractable power. 

3 Background review 
3.1 Gaussian process model 

Let the i-th input and output of the computer simulator be denoted by a d-dimensional vector, 
Xi = {xii, ...,Xid), and the univariate response, in = y{xi), respectively. The n x d experiment 
design matrix, X, is the matrix of the input trials. The outputs of the simulation trials are held in 
the n-dimensional vector Y = y{X) = {yi,y2, ■ ■ ■ ,yn)' ■ The simulator output, y{xi), is modeled as 

y{xi) = iJ. + z{xi); i = l,...,n, (1) 

where ^ is the overall mean, and z{xi) is a GP with E{z{xi)) = 0, Var{z{xi)) = a^, and 
Cov{z{xi), z{xj)) = a^Rij. In general, y{X) has multivariate normal distribution, A^„,(ln;U, S), 
where S = cr^i?, and is a n x 1 vector of all ones (see Sacks et al. 1989, and Jones et al. 1998 for 
details). Although there are several choices for the correlation function, we focus on the Gaussian 
correlation because of its properties like smoothness (or differentiability in mean square sense) and 
popularity in other areas like machine learning (radial basis kernels) and geostatistics (kriging). For 
a detailed discussion on correlation functions see Stein (1999), Santner, Williams and Notz (2003), 
and Rasmussen and Williams (2006). The Gaussian correlation function is a special case {pk = 2 

for all k) ol the power exponential correlation family 

d 

Rij = coTT{z{xi),z{xj)) = ^ex.Y){-ek\xik - Xjk\^''} , for all (2) 

k=l 

where 9 = {9i, ...^Oii) is the vector of hyper-parameters, and pk € (0,2] is the smoothness param- 
eter. As discussed in Section 7, the results developed in this paper may slightly vary when other 
correlation structures are used instead of the Gaussian correlation. 
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We use the GP model with Gaussian correlation function to predict responses at any non- 
sampled design point x* . Following the maximum likelihood approach, the best linear unbiased 
predictor (BLUP) at x* is 



R-'Y, (3) 



y{x*) = fi + r'R-\Y -l^fL) 
with mean squared error 

s^{x*) = al{l-2C'r + C'RC) (4) 



where r = (ri(2;*), r„(x*))', rj(x*) = corr(z(x*), and C is such that y{x*) = C'Y. In 

practice, the parameters fx, and 6 are replaced with the estimates (see Sacks et al. 1989, Santner, 
Wilhams and Notz 2003, for details). 

3.2 A computational issue in model fitting 

Fitting a GP model ([1]) - Q to a data set with n observations in d-dimensional input space requires 
numerous evaluations of the log-likelihood function for several realizations of the parameter vector 
{6i, ...,9^; fi,af). The closed form estimators of and o"^, given by 

m = (l„'ii-'l„)-(l„'fl-'y) and = (^-lnAW)-fi-'(V-lnAW)^ 

n 

are often used to obtain the profile log-likelihood 

- 21ogLp cc log{\R\) + nlog[(y - l^mYR^HY " InAW)], (6) 

for estimating the hyper-parameters 6 = {9i, ...,9d), where \R\ denotes the determinant of R. Recall 
from d^l), that the correlation matrix R depends on 9 and the design points. 

An n X n matrix R is said to be near-singular (or, ill-conditioned) if its condition number 
k{R) = \\R\\ ■ is too large (see Section 4 for details on "how large is large?"), where || ■ || 

denotes a matrix norm (we will use the L2-norm) . Although these correlation matrices are positive 
definite by definition, computation of \R\ and R^^ can sometimes be unstable due to ill-conditioning. 
This prohibits precise computation of the likelihood and hence the parameter estimates. 

Ill-conditioning of R often occurs if any pair of design points are very close in the input space, 
or 0fc's are close to zero, i.e., Ylk=i ^k\^ik ~ ^jkl^'' ~ 0. The distances between neighboring points 
in space-filling designs with large n (sample size) and small d (input dimension) can be very small. 



Near-singularity is more common in the sequential design setup (e.g., expected improvement based 
designs, see Jones et al. 1998; Schonlau et al. 1998; Oakley 2004; Huang et al. 2006; Ranjan et 
al. 2008; Taddy et al. 2009), where the follow-up points tend to "pile up" near the pre-specified 
features of interest like the global maximum, contours, quantiles, and so on. 

3.3 The popular approach 

A popular approach to overcome near-singularity of R is to introduce a nugget, < 5 < 1, in 
the GP model by replacing R with a non-singular (well-behaved) Rs = R + 61. Or equivalently, 
introduce an independent white noise process in the model 

y{xi) = fi + z{xi) + €i, i = l,...,n, 

where are i.i.d. N{0,a^). That is, Var(Y) = a^^R + a^I = al{R + 51) for 6 = crf/cr^. Note 
that 6 > 1 represents more numerical uncertainty than the process uncertainty, which is certainly 
undesirable from a statistical perspective. The resulting predictor (also the BLUP) is given by 

' {1 - r'{R + 61)-^ Ir 



and the associated mean squared error s^^{x) is 



-I'+r' 



{R + 6ir'Y, (7) 



sl{x) = al{l-2C'sr + C'sRCs), (8) 

where Cs is such that ys{x) = C'gY. 

Theoretically, it is straightforward to see that the use of a positive nugget in the GP model 
produces a non-interpolator. Jones et al. (1998) show that the GP fit given by ([3|) and @ 
is an interpolator because for 1 < i < n, r'R^^ = e'p where Cj is the j-th unit vector, r = 
(ri(xj), r„(a::j))' and ri{xj) = corr(z(xj), z(xj)). If we use a 5(> 0) in the model (i.e., replace R 
with Rs), then r'Rj^ ^ e'- and thus y{xj) ^ yj and s^(xj) ^ 0. From a practitioner's viewpoint, 
one could sacrifice the exact interpolation if the prediction accuracy of the fit is within the desired 
tolerance, but it is not always achievable (see Section [6] for illustrations). 

The nugget parameter 6 is often estimated along with the other model parameters. However, one 
of the major concerns in the optimization is that the likelihood (modified by replacing R with Rs) 
computation fails if the candidate nugget 5 € (0, 1) is not large enough to overcome ill-conditioning 
of Rs- To avoid this problem in the optimization, it is common to fix an ad-hoc boundary value on 
the nugget parameter. The resulting maximum likelihood estimate is often close to this boundary 
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value and the fit is not an interpolator of the observed data (i.e., the prediction error is more than 
the desired tolerance). Even if the estimated nugget is not near the boundary, the use of a nugget 
in the model in this manner may introduce unnecessary over-smoothing from a practical standpoint 
(Section [6] presents several illustrations). In the next section, we propose a lower bound on the 
nugget that minimizes the unnecessary over-smoothing. 

4 Choosing the Nugget 

Recall from Section 3.2 that an n x n matrix R is said to be ill-conditioned or near-singular if its 
condition number k{R) is too large. Thus, we intend to find 6 such that k{Rs) is smaller than 
a certain threshold. Our main objective here is to compute the condition number of Rs and the 
threshold that classifies Rs as well-behaved. 

Let Ai < A2 < • • • < A„ be the eigenvalues of R then, in the L2-norm, k,{R) = A„/Ai (Golub 
and Van Loan 1996). The addition of 6 along the main diagonal of R shifts all of the eigenvalues of 
R by 6. That is, the eigenvalues oi Rs = R + dl are Xi + 6, i = 1, n, where Aj is the i-th smallest 
eigenvalue of R. Thus, Rs is well-conditioned if 



loginiRs)) < a 
e' 



An + ^ < 



Ai + <5 

> An(K(i;)-e") 
^ - K{R){e^-l) =^^'' 

where k{R) = A„/Ai and e"" is the desired threshold for k,{Rs). Note that 5ib is a function of the 
design points and the hyper-parameter 6. 

The closed form expressions for the eigenvalues and hence the condition number of a Gaussian 
correlation matrix R, in ([2]), for arbitrary 9 and design {xi, ...,Xn} is yet unknown as of our 
knowledge. If x G (—00, 00)'^ and ~ A^(0, a"^), closed form expressions of the expected eigenvalues 
of R are known (see Section 4.3 in Rasmussen and Williams 2006). In our case, x G [0, 1]"^, and the 
design points are often chosen using a space-filling criterion (e.g., Latin hypercube with properties 
like maximin distance, minimum correlation, OA; uniform designs, and so on). In such cases, 
at the most, one may assume x^ ~ U{0,1) for k = l,...,d. In fact, the objectives of building 
efficient emulators for computer simulators often include estimating pre-specified process features 
of interest, and sequential designs (e.g., expected improvement based designs) are preferred to 
achieve such goals. In such designs, the follow-up points tend to "pile up" near the feature of 
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interest. The distributions of such design points is not uniform and can be non-trivial to represent 
in analytical expressions. In conclusion, it is almost infeasible to obtain the closed form expressions 
for the eigenvalues of such R. Nonetheless, one can compute these quantities numerically. We 
use Matlab's built-in function eig.m to compute the maximum eigenvalue of R and cond.m to 
approximate the condition number k{R) = A„/Ai in the expression of Sn,. 

Another important component of the proposed lower bound is the threshold for getting well- 
behaved non-singular correlation matrices. As one would suspect, the near-singularity of such a 
correlation matrix depends on n, d, the distribution of {xi, ...,Xn} € [0,1]*^ and € (0,oo)'^. We 
now present the key steps of the simulation algorithm used for estimating the threshold under a 
specific design framework. The results are averaged over the distribution of {xi, and 9, and 

thus it is sufficient to find the threshold of k.{R). 

For several combinations of n and d we generate 5000 correlation matrices with randomly 
chosen 9 and {xi, ...,Xn} under the maximin Latin hypercube design scheme (Stein 1987), and then 
compute the proportion of matrices that are near-singular (see the contours in the left panel of 
Figure [2]). We used Matlab's built-in function Ihsdesign.m to generate the design points and chol.m 
to check whether or not a matrix R was near-singular under the working precision. 

For a positive definite well-behaved matrix R, "\U,p\ = chol{R)" produces an upper 
triangular matrix U satisfying U'U = R and p is zero. If R is not positive definite, 
then p is a positive integer. 

We also computed the condition numbers of these 5000 correlation matrices (using Matlab's built-in 
function cond.m). The right panel of Figure [2] presents the contours of the average of log(K(i?)) for 
different combinations of n and d. 

From Figure [21 it is clear that a ~ 25 can be used as the threshold for log{K{Rs)) of a well- 
behaved correlation matrix R^. Also note that the proportion of near-singular cases, denoted 
by the contours in the left panel of Figure decreases rapidly with the increment in the input 
dimension. This is somewhat intuitive because the volume of the void (or unexplored region) 
increases exponentially with the dimension, and a really large space-filling design is needed to 
jeopardize the conditioning of the correlation matrices in high dimensional input space. For other 
design schemes, one can follow these steps to estimate the threshold for the condition number of 
well-behaved correlation matrices. 

The lower bound on the nugget is only a sufficient condition and not a necessary one for Rs 
to be well-conditioned. For instance, a correlation matrix with 100 designs points in (0, 1)^ chosen 
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Figure 2: The contours in the left panel shows the proportion of correlation matrices flagged as 
near-singular. The contours in the right panel displays average log{K.{R)) values. The shaded region 
in the left panel corresponds to log(«:(i2)) > 25. 

using a space-filling criterion may lead to a well-behaved i? if is very large. If the correlation 
matrix is well-conditioned, R should be used instead of Rs, i.e., 



That is, when R is well-behaved our approach allows (5/6 to be zero and hence a more accurate 
surrogate can be obtained as compared to the popular approach (Section 3.3) where a non-zero 
nugget is forced in the model which may lead to undesirable over-smoothing. This could be of 
concern in high dimensional input space, because the proportion of near-singular cases decreases 
with the increment in the input dimension. Example 3 demonstrates the performance of the 
proposed methodology over the popular approach for an eight-dimensional simulator. 

Although the use of 6ih in the GP model minimizes the over-smoothing, Sn, may not be small 
enough to achieve the desired prediction accuracy (see Examples 1 and 2 for illustrations), and 
choosing 6 < 6ib may lead to ill-conditioned R. This may not be a big issue if one believes that the 
simulator is somewhat noisy and / or the statistical emulator is biased due to miss-specification in the 
correlation structure or model assumptions. In such cases, a little smoothing might be a good idea. 
However, controlling the amount of smoothing is a non-trivial task and requires more attention. 
On the other hand, over-smoothing is undesirable if the experimenter believes that the computer 




(9) 
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simulator is deterministic and the statistician is confident about the choice of the emulator (we 
consider the GP model with Gaussian correlation structure). Under these assumptions, we now 
propose a new predictor that can achieve the desired level of prediction accuracy. 



5 New Iterative Approach 

In this section, we propose a predictor that is based on the iterative use of a nugget 5 € (0, 1). This 
approach does not depend severely on the magnitude of the nugget, and the results developed here 
are based on an arbitrary < 6 < 1 large enough to ensure Rs well-behaved. However, choosing 

6 > 5ib may require more iterations to attain the desired prediction accuracy, and we recommend 
using 6ib. We also show that the proposed predictor converges to an interpolator ([SD and (jH). 

Recall that the key problem here is the inaccurate computation of \R\ and due to ill- 
conditioning of R. The main idea of the new approach is to rewrite the profile log-likelihood as 



and replace the ill-conditioned R~ with a well-behaved quantity. This modified profile log- 
likelihood can then be optimized to get the parameter estimates. Next, we describe how to find 
the appropriate well-behaved substitute for R~^. 

In the same spirit as the popular approach, we attempt to evaluate R~^w by solving Rt = w, 
under the assumption that R cannot be inverted accurately (i.e., R is near-singular) while there 
exists a (5 € (0, 1) such that Rs = {R + 6I) is well-conditioned. In an attempt to find an interpolator 
of the simulator (up to certain accuracy), our objective is to find t* = f{Rs,w) that is a better 
approximation of t = R~^w as compared to t = RJ^w, suggested by the popular approach. To 
achieve this goal, we propose to use iterative regularization (e.g., Tikhonov 1963, Neumaier 1998), 
a technique for solving ill-conditioned systems of equations. 

Let So = w and Si,i = 1,...,M, be a sequence of vectors obtained by recursively solving the 
system of equations given by 



Then, the estimate of t = R after the i-th iteration (1 < i < M) of regularization is given by 



21ogLp oc -log(|i?-^|) + nlog[(y - 1„AW)'^"'(>^ " WW)] 



(10) 



(i? -I- 5I)si = 5si-i. 



(11) 




(12) 
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where to is a vector of zeros. The final solution with M iterations of regularization, 

M 

tM = Y.d''-HR + SI)-''w, 

k=l 

requires only one direct inversion (or one Cholesky decomposition) of ii^ = R + 61, followed by 
M forward and backward substitutions. The proposed approximation of t = R^^w is Im for 
M > 1 chosen to satisfy the prediction accuracy requirement. Lemma [T] shows that the iterative 
regularization approach in (|lip and (|12p leads to a solution that is a generalization of the popular 
approach outlined in Section 3.3. 

Lemma 1 Let R be a n x n positive definite correlation matrix, I be the n x n identity matrix, 
and < 5 < 1 be a constant, then 



R^^ = ^5''^\R + 6iy 



J yiL -r ui j 

k=l 

The convergence of this infinite series follows from the von- Neumann series (the matrix version 
of the Taylor series, Lebedev 1997) expansion of g{u) = {R + ul)^^ around u = 6: 

g{u) = g{5) + {u-5)g'{5)+^^^^^g"{5)+^^^^^g"'{5) + ... , 
i.e., {R + uiy^ = {R + Siy^ + {u-6){-l){R + 5iy^ 

+ ^^^^{-l){-2){R + 6I)-' + --- 
= {R + 61)-^ -{u-6){R + 61)-^ + {u- 6f{R + SI)' 



-3 



Setting n = 0, we get R^^ = YlT=i ^'''^{R + 5I)~'^ and thus the proposed solution obtained using 
the iterative regularization is the M-th order von-Neumann approximation of R~^ . The predictor 

in the popular approach ([7D uses ti, the first order von-Neumann approximation, and hence our 
proposed approach is a generalization of the popular approach. 

To implement the proposed regularization in model fitting, the modified profile log-likelihood 
to be optimized is 

- 21ogLp oc -log(|ii,J,|) + nlog[(y - l^m)' RShi^ " ln/i(e))], (13) 

where R^l^ = 'ZkU 6^-^{R + SI)"^. Closed form expressions for jl(6) and <7^(^) are the same as 
in ([5]) subject to R~^ replaced by Rj\f- The new regularized predictor ys^M^x) at x € x is 



ys,M{x) 



1' /?~^ 1 
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and the corresponding MSE s'jj^j{x) is given by 

4M^) = ^'(1 - + C's^mRCsm)^ (15) 

where Cs^m is such that ys^M{x) = C'^j^jY. Lemmas [2] and [3] estabhsh the convergence results for 
an arbitrary < 6 < 1. 

Lemma 2 Let R be a near-singular correlation matrix as defined in and < 6 < 1 be a nugget 
such that R + SI is well-behaved. Then, for every x* G x = [0, l]'^, 

lim y5,M{x*) = y{x*), 

M->oo 

where y{x*) and ys,Aiix*) are defined in (OP and respectively. 

The proof follows from Lemma [1] and using lim RjIj = in (jl4p . It is straightforward to 
show that Cs^M in (|15|) converges to C in ([U as M — ?• oo. This also proves the next result on the 
convergence of the mean squared error for the proposed predictor. 

Lemma 3 Let R be a near-singular correlation matrix as defined in and < 6 < 1 be a nugget 
such that R + 51 is well-behaved. Then, for every x* G x = [0, l]'^, 

where s{x*) and ss^m{x*) are defined in ^ and i fTC]) respectively. 

Lemmas [2] and [3] prove that even if a few pairs of points are too close together in the input space, 
or 9k s are close to zero to cause near-singularity of R, the proposed iterative predictor converges 
to an interpolator as M increases (i.e., for 1 < i < n, ys^Mi^i) — > yi and s|jvf(^«) as M ^ oo). 

Remark: Li practice, when a pre-specified prediction accuracy is desired, the proposed iterative 
approach suggests refitting the GP model (i.e., optimization of (jl3p ) for different choices of M > 1. 
Note that the parameter estimates change with M which allows for the extra flexibility in the 
model that adjusts the over-smoothed portion of the surrogate. First of all, the computational 
cost of fitting this model increases with M. Secondly, the combined cost of refitting the model for 
different values of M can be quite large. Although the numerical stability in computing Rj\.f does 
not change with M, computation of l-R^^j^jl can become less numerically stable with increasing M. 
This is because Rj\,f — > R~^ as M ^ oo and the computation of |-R~^| is assumed to unstable. 
Considering these issues, we recommend optimizing the profile log- likelihood (|13p with M = 1 to 
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obtain 6mie and then use it to compute ys,Mix) and ss,A'i{x) for any M > 1 by following the iterative 
regularization steps outlined above. 



The convergence results in Lemmas (TJ [2] and [3] do not depend on the choice of 9 and 5 in 
Rs = R + SI, and so the predictor obtained is still an interpolator. The key steps required for the 
implementation of the proposed approach are as follows: 

1. Computation of the profile log-likelihood (jlOp for the estimation of 9. 

(a) Choose a candidate 9 in Q'^ and compute R. 

(b) Compute the lower bound of nugget 6ib in (9). Note that 5/6 is a function of the hyper- 
parameters 9, the design matrix and the threshold. 

(c) Replace R^^ with RJ^^^^ in the likelihood ([10]). 

2. Obtain the parameter estimates 9 and 5ib{9) by optimizing the profile log-likelihood. Then 
compute fi{9) and <5-^(^). 

3. Use the parameter estimates 9, 6ih{9), fi{9) and (7^(6') to compute the regularized emulator 
given by y<5,j,,M(a^) and m(^) ™ and ([T5|) respectively. 

The number of iterations (M) depends on the desired prediction accuracy, and one can build 
stopping rules for attaining the pre-specified accuracy in (|14p . The first quantity 



measures how close the approximation is to an exact interpolator. We define another quantity 

t , T.i=i\y5,k{^i) - ys,k-i{xi)\ 

that measures the relative improvement in the prediction by increasing the number of terms in 
the von-Neumann approximation. Lemmas [2] and [3] show that both and goes to — oo as k 
increases, however, as we will see in Example 1, the rates of convergence of and may differ. 
The convergence of means that the predictor is converging to the BLUP, but the convergence of 
implies that the predictor ys^k has stabilized. That is, both of these measures (^^ and ^fc) can 
be used in practice to choose appropriate M for achieving the desired prediction accuracy. The 
next section illustrates that even the best choice of 6 can lead to over-smooth emulators, and the 
iterative approach is advantageous. 
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6 Examples 



To illustrate the proposed approach we first present a few simulated examples. The performance 
of the new iterative predictor is also compared with the popular approach. Then, we revisit the 
tidal power modeling example. 

Example 1 Let xi,X2 € [0,1], and the underlying deterministic simulator output be generated 
using the GoldPrice function (Andre, Siarry and Dognon 2000), 



For illustration purpose, we intentionally select a maximin Latin hypercube design (Stein 1987) 
with n = 70 points that leads to an ill-conditioned correlation matrix for small 9 € (0, oo)^. It 
turns out that for this particular design (see Figure 3), the correlation matrix R is ill-conditioned 
if ^1 • ^2 ^ 2.5. Figure O^a) presents the contours (at heights y = 120 and 10'^) of the true simulator 
(solid curve) and the GP surrogate fit (obtained using the methodology outlined in Sections 3.1 
and 3.3). For successful implementation of the popular approach, we optimized the likelihood in 
the parameter space 6 G (10~^,1) and 9 € (0, oo)^. The parameter estimates for the GP fit are 
^mie = 1-05 • 10~^ and 6mie = (4.41,8.54). Note that 6mie is close to the boundary and the fitted 
surrogate is significantly different than the reality in the central part of the input space. 

When the proposed method (as outlined in Sections 4 and 5) was used to fit the GP model, the 
parameter estimates were Omie = (2.64,4.03) and 6ih{9mie) = 4.63 • 10~^°. The fitted surrogate for 
M = 1, in Figure [3|^b), shows a much better fit. Moreover, the iterative approach leads to further 
improvement on the fitted GP surrogate (see Figure [3|^c) and [3||d) ) . The overall convergence of 
and is shown in Figure [H Note that converges to — oo at a faster rate as compared to 



Table [T] summarizes the comparison results from a more detailed simulation study based on 
several combinations of run-sizes of the space-filling design and the boundary values of 6 in the 
likelihood optimization. For fair comparison between the two methodologies, first we use the 
proposed model with only one term in von-Neumann approximation (i.e., M = 1) and the popular 
method with 6mie- Then we increase the number of iterations to measure the performance of the 
iterative approach in improving the prediction accuracy. The results in Table [1] are summarized over 
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(a) Popular fit with MLE 



(b) M = 1 (witli lower bound) 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 



(c) M = 5 (with lower bound) 



(d) M = 20 (with lower bound) 



Figure 3: The dots denote the design points, the sohd curves denote the contours of the true 
GoldPrice function, and the dashed curves represent the contours of the predicted surfaces. 




Figure 4: Convergence of (dashed curve - right axis) and (sohd curve - left axis). 



model fits with 50 random maximin Latin hypercube designs. The table entries are P50 (P5,P95), 



where Pr denotes the r-th percentile of values obtained from the model fits. 



16 



Table 1: Median and (i-5,Pg5) of values for the proposed approach (denoted by and the 
popular approach (denoted by "mZe") applied to the GoldPrice function. 





n = 25 


n = 50 


n = 75 


n = 100 


l^u > 10-' 

Smle > 10-"' 


-1.96 (-3.43, -0.58) 
-2.47 (-4.91, -0.91) 


-2.32 (-2.56, -2.06) 
-2.81 (-4.85, -1.90) 


-2.21 (-2.36, -2.06) 
-3.24 (-4.07, -2.56) 


-2.25 (-2.38, -2.13) 
-3.58 (-4.07, -3.17) 


5,b{M = 1) 

5w{M = 5) 
5ibiM = 20) 


-14.01 (-15.08, -12.22) 


-11.30 (-12.44, -10.57) 


-3.50 (-3.78, -3.01) 
-3.91 (-4.27, -3.37) 
-4.29 (-4.63, -3.70) 


-3.21 (-3.37, -3.08) 
-3.64 (-3.82, -3.50) 
-3.91 (-4.09, -3.71) 



Prom Table [H it is clear that the relative prediction error of the surrogates fitted using the 
popular approach decreases by lowering the boundary value of the nugget parameter (from 10"^ 
to 10~^^) in the optimization problem. It turns out that the correlation matrices are well-behaved 
for small designs (i.e., n = 25 and n = 50) and non zero nuggets are not required for numerically 
stable model fitting process. This is captured by the proposed approach, as 6ib{6mie) = and the 
relative prediction error is much smaller than in the popular approach where a non-zero nugget is 
forced in the model. Consequently, the iterative approach is not used for these cases. For n = 75 
and n = 100, the correlation matrices turn out to near-singular for 6 near 6mie, and non-zero 6 
had to be used for numerically stable computation. It is clear from the last three rows that the 
proposed iterative approach leads to improvement in the prediction accuracy. 

Example 2 Suppose the deterministic simulator outputs are generated using the three dimensional 
Perm function (Yang 2010) given by 

2 



k=l 



i=l 



where x = (xi,X2,X3) and the i-th input Xi € [—3,3] for i = 1, ...,3. For convenience, we re-scale 
the input variables in [0,1]. As in the previous example, we fit the GP model using both the 
popular method (Section 3.3) and the proposed approach (Sections 4 and 5) to the data generated 
by evaluating the Perm function at n design points chosen using maximin Latin hypercube scheme. 
Table [2] compares the median and the two tail percentiles (P5, P95) of values obtained from fitting 
GP models to 50 randomly generated data sets of different run-sizes. Here also, we pre-specified 
the boundary values for estimating 5 in the popular approach. 

As in Example 2, the relative prediction errors of the GP fits obtained through the popular 
approach is slightly reduced by lowering the boundary value of 6 (from 10"^ to lO"^*') in the 
optimization process. The small values of the percentiles of in the row labelled ^^6ib (M = 1)" 
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Table 2: Median and (i-5,Pg5) of values for the proposed approach (denoted by and the 
popular approach (denoted by "m/e") apphed to the perm function. 





n = 25 


n = 50 


n = 75 


n = 100 


> IC 
Sr^le > 10-"' 


-0.99 (-3.27, -0.60) 
-1.06 (-3.73, -0.64) 


-1.87 (-2.67, -1.48) 
-1.89 (-3.13, -1.57) 


-1.76 (-2.35, -1.57) 
-1.76 (-2.51, -1.61) 


-1.83 (-2.25, -1.62) 
-1.84 (-2.68, -1.63) 


Sli{M = 1) 
5lb{M = 5) 


-14.58 (-15.06, -14.04) 


-12.88 (-13.43, -12.05) 


-12.05 (-12.56, -11.44) 


-1.79 (-12.86, -1.64) 
-11.70 (-12.82, -1.71) 



of Table El indicates that the correlation matrices are well-behaved (i.e., Sib{0mie) = 0) for most 
of the designs with runs-sizes n = 25, 50 and 75, and at least 5% of the designs of size n = 100. 
When the number of iterations is increased to M = 5, the prediction accuracy of the model fits has 
significantly increased and at least 50% of the designs of size n = 100 indicates Sih{9mie) = 0. Note 
that the proposed approach facilitates the inclusion of a non-zero nugget (the smallest possible 
6 needed) only when required for fixing the ill-conditioning of a correlation matrix. Clearly, the 
number of realizations that required a non-zero nugget in the GP models here is much smaller 
than in the GoldPrice example. This is expected because getting near-singular correlation matrices 
becomes less likely as the dimensionality of the input space increases. 

Example 3 The borehole model is a more realistic deterministic simulator, that models the flow 
rate through a borehole which is drilled from the ground surface through two aquifers, and is 
commonly used in computer experiments (e.g., Joseph, Hung and Sudjianto 2008) to compare 
different methods. The flow rate is given by 

27rTuiHu - Hi) 



fix) 



1 + 



2LT„ 



I Til 

^ Ti 



ln{r / r^y^K^ 

where x = {r^^r, Tu,Ti, Hu, Hi,L, Kw), and the input € [0.05, 0.15] is the radius of the borehole, 
r € [100,50000] is the radius of the influence, G [63070,115600] is the transmissivity of the 
upper aquifer, T/ G [63.1,116] is the transmissivity of the lower aquifer, G [990,1110] is the 
potentiometric head of the upper aquifer. Hi G [700, 820] is the potentiometric head of the lower 
aquifer, L G [1120, 1680] is the length of the borehole and G [9855, 12045] is the hydraulic 
conductivity of the borehole. For convenience, we re-scale the input variables in [0, 1]. 

Table [3] compares the median and two tail percentiles (P5, P95) of values obtained from the 
GP model surrogates fitted to 50 random maximin Latin hypercube designs via the two methods. 
Since the simulator is 8-dimensional, we considered slightly larger run-sizes n = 50, 75, 100 and 
125, however, the number of simulations and the candidates for the boundary values of 6 in the 
likelihood optimization were kept the same as in Examples 1 and 2. 
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Table 3: Median and (i-5,Pg5) of values for the proposed approach (denoted by and the 
popular approach (denoted by "m/e") applied to the borehole model. 





n = 50 


n = 75 


n = 100 


n = 125 


Smle > 10 " 
Smle > 10-'° 


-2.86 (-3.09, -2.56) 
-3.23 (-4.38, -2.69) 


-2.91 (-3.15, -2.71) 
-3.37 (-4.16, -2.94) 


-2.97 (-3.16, -2.77) 
-3.51 (-5.00, -3.05) 


-3.02 (-3.20, -2.86) 
-3.53 (-4.58, -3.17) 


5ib{M = 1) 


-12.31 (-12.75, -11.69) 


-11.56 (-11.95, -10.87) 


-10.85 (-11.31, -10.07) 


-11.27 (-11.79, -10.83) 



As expected, the relative prediction error of the GP fits obtained using the popular approach 
slightly decreases by lowering the boundary value of the nugget parameter (from 10~^ to 10~^^) 
in the optimization problem. The proposed method leads to predictors with significantly higher 
prediction accuracy. The percentiles of in the last row of Table [3] also suggest that most of the 
correlation matrices are well-conditioned for the 6 values near Omiei a-nd Sif,{9mie) = 0. That is, the 
iterative approch is not needed to further improve the prediction accuracy. 

Example 4 We now revisit the tidal power example in Section 2. The computer simulator (a 
version of FVCOM) is expensive and cannot be evaluated at numerous coordinates. Each of the runs 
presented here required approximately one hour to run on 4 processors in parallel on the Atlantic 
Computational Excellence network (ACEnet) mahone cluster. While this is not particularly onerous 
on a large cluster, the grid resolution used in KMLH is about 200 m (length of a side in a triangle). 
A realistic model of 20 m sided triangular grid and with 10 vertical layers to model 3D flow 
would increase the computational expense by a factor of 5120, making each individual simulator 
run roughly 10 times more costly than the generation of the entire data set examined here. The 
ocean modelers believe that the simulator is deterministic up to the machine precision and they 
are interested in the emulator that interpolates the simulator. 

A total of 533 runs (on a 13 x 41 grid) were used to obtain the data displayed in Figure [5l We 
use this data to compare our results. The goal is to build an emulator of the computer model using 
a fraction of the budget (533 runs) that provides the best approximation of the simulator. 

We used a maximin based coverage design (Johnson, Moore and Ylvisaker 1990) to choose a 
subset of n = 100 points from these 533 points to constitute a space-filling design. The contours 
from both the predicted surface and the true simulator (based on the 13 x 41 grid) are shown in 
Figure m For successful implementation of the popular approach outlined in Sections 3.1 and 3.3, 
the likelihood optimization took place in the parameter space 6 € (10~^, 1) and 9 € (0, oo)^, and the 
parameter estimates for the GP fit are Omie = (163.18,50.66) and 5mie = 0.0462 (see Figure 6(a)). 
The parameter estimates for the GP model fitted using the proposed approach with M = 1 are 
Omie ~ (788.54,221.18) and 6ib{B^ie) ~ (see Figure 6(b)). 
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-64.5 -64.48 -64.46 -64.44 -64.42 -64.4 -64.38 -64.36 -64.34 



Figure 5: FVCOM outputs (average extractable power) over a coarse grid in the Minas Passage. 



0.8r • • 0.8 




X, X, 

(a) Popular fit with MLE (b) A/ = 1 (with lower bound) 



Figure 6: Dots denote the design points, the solid curves denote the contours for the true simulator 
based on the 13 x 41 grid, the dashed curves show the contours of the GP fits. 

Figure [6] shows that the GP based emulator obtained using the proposed approach (Figure [6jb)) 
is less smooth as compared to that obtained via the popular approach (Figure [6|^a)). The relative 
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prediction errors for the GP fits with the popular method and the proposed approach are = —1.06 
and —15.18 respectively. That is, the proposed approach leads to significantly better approximation 
of the iterpolator of the simulator. Moreover, when using the popular approach, the maximum 
predicted power obtained by evaluating max{y(x),x G x} is approximately 1.4 • 10^ W with the 
maximizer being (0.7850, 0.4500), whereas if we use the proposed approach the maximum predicted 
power IS approximately 1.6 • 10^ W observed at (0.7900, 0.4500). 

7 Discussion 

Assuming that the underlying computer simulator is deterministic up to the machine precision 
and the statistician is confident about the choice of the simulator (i.e., GP model with Gaussian 
correlation), fitting the model to a data set with n points in d-dimensional input space requires 
computation of the determinant and inverse of n x n correlation matrices for several 9 values. In 
Section HI we conducted a simulation study to explore space-filling designs (specifically maximin 
Latin hypercube designs) for different combinations of (n, d) that can lead to near-singular correla- 
tion matrices. In Section [5l we proposed an iterative approach, that is also a generalization of the 
popular approach, to construct a new predictor ys,M that has higher prediction accuracy compared 
to ys - the one with the popular approach. Lemmas 1, 2 and 3 show that ys,M converges to the 
BLUP as the number of iterations (M) increases. The lower bound 5ib, proposed in Section HI also 
allows us to use a non-zero nugget only when required, and minimizes the number of iterations 
when required (i.e., 5ib > 0) to reach the desired tolerance of prediction accuracy. 

There are a few important remarks worth noting. First, the methodology developed here can also 
be adapted in the Bayesian framework. For computing the posterior distribution of the parameters 
and the predictor, needs to be computed for several realizations of 9, and a nugget is often 
used to overcome the near-singularity problem (e.g., Taddy et al. 2009). The proposed lower bound 
5ih can be used for defining a prior for 5, i.e., the search should be limited to [5ih, 1), because 5 < 6ib 
may lead to near-singular cases. One can also use the iterative approach to further improve the 
prediction accuracy. 

Second, we used the squared exponential correlation (p^ = 2 for all k) in the GP model because 
of its popularity and good theoretical properties. It turns out the GP model with other power 
exponential correlation function (i.e., pk < 2) may lead to predictors with larger MSE and some- 
times worse fits as compared to that of the GP models with the Gaussian correlation function. 
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The ill-conditioning problem may also occur when other power exponential correlation functions 
(i.e., Pk < 2) are used. Recall that the near-singularity of R occurs because (a) at least two of the 
design points (say Xi and Xj) are close together in the input space, and/or (b) the hyper-parameters 
9k, k = 1, d are very close to zero, i.e., Ylk=i Gk\xk,i — Xkj\^'' ~ 0. This makes a few of the rows of 
R very similar, and will happen even if the power (pk) is less than two. A closer investigation reveals 
that even with pk < 2, near-singular cases occur very frequently in the sequential design setup. 
However, for the space-filling designs, it is rather fascinating that the occurrence of near-singular 
cases is substantially reduced by even a small reduction in the power from pk = 2 to pk = 1.99. 
We suspect this is due to the limiting behaviour of the Gaussian correlation in the family of power 
exponential correlation functions pk G (0,2]. 

In conclusion, when fitting a GP model to a data set obtained from a deterministic computer 
model if the correlation matrices are near-singular, we recommend using 5ib - the lower bound on 
the nugget, along with the iterative approach with the number of iterations, M, chosen according 
to the desired prediction accuracy. 
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